Data Import (short report)

Author

Josef Mana

Published

August 14, 2025

Code
# Load packages instrumental for the report:
library(tidyverse)
library(targets)

# Set-up targets variables:
upstore <- here::here(tar_config_get("store"))
tar_source(here::here("R"))
tar_load(raw_data)

In this report, I showcase data import procedures and basic data description for the AggressiveSclerosis project wherein the goal is to identify the most significant predictors among a set of reasonable candidate predictors of a more aggressive phenotype of Multiple Sclerosis (MS).

The whole process is being document on its GitHub repository: https://github.com/josefmana/AggressiveSclerosis.git.

Step 1: Define Aggressive MS

The primary outcome of this first report is the determine_aggressive_phenotype() function which implement an algorithm diagnosing aggressive phenotype of MS to the sample of patients eligible for this study. The function is based on Expanded Disability Status Scale (EDSS) measurement taken at least half a year away from a relapse, and takes the following steps:

  1. Exclude patients with less than four EDSS measurements.
  2. Exclude patients with Primary Progressive phenotype of MS.
  3. Exclude patients with less than 10 years of disease duration.
  4. Label MS of patients with no \(EDSS \geq 6\) during the first 10 years as non-agressive.
  5. Label MS of patients who did not sustain \(EDSS \geq 6\) for more than 6 months as non-agressive.
  6. Label MS of patients who did not sustain \(EDSS \geq 6\) until the end of observation period as non-agressive.
  7. Label MS of the remaining patients as aggressive.

Moreover, the function prepares and prints a text summarising the process and allows for an eye check of the data of the patients diagnosed with aggressive MS for control.

This is the function in action:

Determining Aggressive MS Phenotype
data <- determine_aggressive_phenotype(
  demographics = raw_data$id,
  relapses = raw_data$relapses,
  edss = raw_data$edss,
  eye_check = TRUE
)
Dropping 359 out of 4838 patients with less than four EDSS measurements.
Dropping 235 out of 4479 patients with the primary progressive or missing phenotype.

Evaluating the first criterion - disease duration ≥ 10 years ... 
Dropping 888 out of 4244 remaining patients.

Evaluating the second criterion - EDSS ≥ 6 within the first 10 years of disease ...
Dropping 3220 out of 3356 remaining patients.

Evaluating the third criterion - EDSS ≥ 6 sustained for at least 6 months ...
Dropping 9 out of 136 remaining patients.

Evaluating the final criterion - EDSS ≥ 6 remaining until the end of observation period ...
Dropping 30 out of 127 remaining patients.

This has left 97 out of total 3356 eligible patients (2.89%)
being classified as suffering the aggressive form of MS.

Plots containing EDSS data of patients classified
as suffering the aggressive form are shown for control.

  |                                                                            
  |                                                                      |   0%


  |                                                                            
  |======                                                                |   9%


  |                                                                            
  |=============                                                         |  18%


  |                                                                            
  |===================                                                   |  27%


  |                                                                            
  |=========================                                             |  36%


  |                                                                            
  |================================                                      |  45%


  |                                                                            
  |======================================                                |  55%


  |                                                                            
  |=============================================                         |  64%


  |                                                                            
  |===================================================                   |  73%


  |                                                                            
  |=========================================================             |  82%


  |                                                                            
  |================================================================      |  91%


  |                                                                            
  |======================================================================| 100%
Data from 4838 patients were extracted from a local database. Out of these,
359 were excluded due to having less than four EDSS observations,
and 235 were excluded due to a diagnosis of primary progressive multiple
sclerosis or missing phenotype information. Out of the remaining 4244
patients, 3356 patients met the criterion of a minimum of 10 years observation
time, as defined by recorded EDSS scores. Of these, 3220 patients did
not meet the criterion of EDSS ≥ 6 within the first 10 years of disease.
Further 9 patients did not meet the criterion of sustaining EDSS ≥ 6
until the end of the observation period and 30 patients did not meet
the criterion of sustaining EDSS ≥ 6 logitudinally.
The final sample thus comprised 3356 patients, of whom 97 (2.89%) met
criteria for aggressive disease.

Step 2: Preprocess Predictor Variables

The next step was prepare predictor pool for variables that will be used to help understand the risk associated with agressive form of MS. the following predictors were prepared:

  • Treatment variables:
    • Percentage of time spent using Platform (1st line) medication during the first 5 and 10 years of disease.
    • Percentage of time spent using HET (2nd line) medication during the first 5 and 10 years of disease.
    • Percentage of time spent using LE HET (2nd line) medication during the first 5 and 10 years of disease.
    • Time to first medication (Platform, HET or LE HET)
  • Symptoms::
    • number of relapses during the first 10 years of the disease
  • MRI variables:
    • T2 lesions volume at the closes point before years 2, 5 and 10 of the disease course
    • T1 blacholes volume at the closes point before years 2, 5 and 10 of the disease course
    • Brain Atrophy at the closes point before years 2, 5 and 10 of the disease course
    • T1 TBV at the closes point before years 2, 5 and 10 of the disease course
    • Corpus Callosum volume at the closes point before years 2, 5 and 10 of the disease course

Preprocessing of these variables is implemented in the function preprocess_predictors().

pred_data <- preprocess_predictors(
  d0 = data$data,
  t = raw_data$treatment,
  r = raw_data$relapses,
  m = raw_data$mri,
  c = raw_data$csf,
  o = raw_data$ocb,
  chol = raw_data$cholesterol
)
Treatment and MRI Distributions

Figure 1 and Figure 2 show data distribution of treatment and MRI variables separately for patient without and with diagnosis of aggressive MS and p-value from a Mann-Whitney test comparing these distributions.

Code
med_data <- pred_data |>
  select(id, aggressive, all_of(naming$var[naming$typ == "Medication"]))

for(i in which(naming$typ == "Medication")) {
  med_data[[naming$nam[i]]] <- med_data[[naming$var[i]]]
}

med_data <- med_data |>
  select(id, aggressive, all_of(naming$nam[naming$typ == "Medication"]))

med_data |>
  pivot_longer(-c(id, aggressive), names_to = c("Medication", "Year"), names_sep = "_", values_to = "Value") |>
  ggpubr::ggviolin(
    x = "aggressive", xlab = "Agressive MS",
    y = "Value",
    facet.by = c("Medication", "Year"),
    add = "median",
    short.panel.labs = FALSE
  ) +
  ggpubr::stat_compare_means(vjust = -1, hjust = -0.8, label = "p.format")
Figure 1: Violin plots and Mann-Whitney test results comparig treatment distribution variables between study groups
Code
mri_data <- pred_data |>
  select(id, aggressive, all_of(naming$var[naming$typ == "MRI"]))

for(i in which(naming$typ == "MRI")) {
  mri_data[[naming$nam[i]]] <- mri_data[[naming$var[i]]]
}

mri_data <- mri_data |>
  select(id, aggressive, all_of(naming$nam[naming$typ == "MRI"]))

mri_data |>
  pivot_longer(-c(id, aggressive), names_to = c("Measure", "Year"), names_sep = "_", values_to = "Value") |>
  mutate(Year = factor(Year, levels = c("Y2", "Y5", "Y10"), ordered = TRUE)) |>
  ggpubr::ggboxplot(
    x = "aggressive", xlab = "Agressive MS",
    y = "Value",
    facet.by = c("Measure", "Year"),
    scales = "free_y",
    repel = TRUE
  ) +
  scale_y_continuous(
    expand = expansion(mult = c(0, 0.25))  # 10% extra space on top
  ) +
  ggpubr::stat_compare_means(vjust = -1.2, label = "p.format")
Figure 2: Box plots and Mann-Whitney test results comparig MRI variables’ distribution variables between study groups